1) Using GAM and GLM to examine the mortality rates

The Excel document influenza.xlsx contains weekly data on the mortality and the number of laboratory-confirmed cases of influenza in Sweden. In addition, there is information about population-weighted temperature anomalies (temperature deficits).

influenza.xlsx
Year Week Time Mortality Influenza Temperature deficit
1995 1 1995.000 1940 0 1.34518
1995 2 1995.019 1898 0 0.00000
1995 3 1995.038 1812 0 0.00000
1995 4 1995.058 1869 6 1.40262
1995 5 1995.077 1857 2 0.00000
1995 6 1995.096 1802 4 0.00000

1.1)

Task: Use time series plots to visually inspect how the mortality and influenza number vary with time (use Time as X axis). By using this plot, comment how the amounts of influenza cases are related to mortality rates.

Answer: One plot for mortality and one for influenza. To better compare those two and show, if they are related or not, there is a third plot showing them in one plot.

If both plots are superimposed we can observe that it seems like that mortality in influenced by influenza.

1.2)

Task: Use gam() function from mgcv package to fit a GAM model in which Mortality is normally distributed and modelled as a linear function of Year and spline function of Week, and make sure that the model parameters are selected by the generalized cross-validation. Report the underlying probabilistic model.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Mortality ~ Year + s(Week)
## 
## Estimated degrees of freedom:
## 8.59  total = 10.59 
## 
## GCV score: 9014.609

The probabilistic model looks as follows:

\[ Model = \mathcal{N}(\beta_{year} * X_{Year} +S_{week} * X_{week} + \alpha, \sigma^2) \]

1.3)

Task: Plot predicted and observed mortality against time for the fitted model and comment on the quality of the fit. Investigate the output of the GAM model and report which terms appear to be significant in the model. Is there a trend in mortality change from one year to another? Plot the spline component and interpret the plot.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Mortality ~ Year + s(Week)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -652.058   3448.379  -0.189     0.85
## Year           1.219      1.725   0.706     0.48
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Week) 8.587  8.951 100.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.661   Deviance explained = 66.8%
## GCV = 9014.6  Scale est. = 8806.7    n = 459

Comment: We can see that the fitted data is actually not performing that bad. It recognized the cyclical trend correctly. The drawback of the smoothing is that the extrem values are not correctly predicted, as those are smoothed out.

We observe that the spline curve is following the trend we can observe over the years, with a peak in the beginning/end of the year.

1.4)

Task: Examine how the penalty factor of the spline function in the GAM model from step 2 influences the estimated deviance of the model. Make plots of the predicted and observed mortality against time for cases of very high and very low penalty factors. What is the relation of the penalty factor to the degrees of freedom? Do your results confirm this relationship?

1.5)

Task: Use the model obtained in step 2 and plot the residuals and the influenza values against time (in one plot). Is the temporal pattern in the residuals correlated to the outbreaks of influenza?

Comment: Yes it seems like that the residuals and the outbreaks are correlated. When there is an outbreak, the residuals also inscrease in that particular moment.

1.6)

Task: Fit a GAM model in R in which mortality is be modelled as an additive function of the spline functions of year, week, and the number of confirmed cases of influenza. Use the output of this GAM function to conclude whether or not the mortality is influenced by the outbreaks of influenza. Provide the plot of the original and fitted Mortality against Time and comment whether the model seems to be better than the previous GAM models.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Mortality ~ Year + s(Week, sp = penalty)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1173.5135  4211.1335   0.279    0.781
## Year           0.3053     2.1067   0.145    0.885
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Week) 1.639  2.036 171.9  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.494   Deviance explained = 49.7%
## GCV =  13263  Scale est. = 13157     n = 459

TODO: Is this model better and does the number of influenza influece the mortality?

2) High-dimensional method

The data file data.csv contains information about 64 e-mails which were manually collected from DBWorld mailing list. They were classified as: ‘announces of conferences’ (1) and ‘everything else’ (0) (variable Conference)

X000euro X5102011 X10th X11th X12noon X12th X13th X14th X15th X16th
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

2.1)

  • Divide data into training and test sets (70/30) without scaling.
  • Perform nearest shrunken centroid classification of training data in which the threshold is chosen by cross-validation.
  • Provide a centroid plot and interpret it. How many features were selected by the method?
  • List the names of the 10 most contributing features and comment whether it is reasonable that they have strong effect on the discrimination between the conference mails and other mails?
  • Report the test error.

## Call:
## pamr.cv(fit = nsc_model, data = mydata)
##    threshold nonzero errors
## 1  0.0       3428    6     
## 2  0.1       3409    6     
## 3  0.2       3114    7     
## 4  0.3       3024    7     
## 5  0.4       3000    7     
## 6  0.5       1979    6     
## 7  0.6        852    6     
## 8  0.7        841    6     
## 9  0.8        673    6     
## 10 0.9        622    6     
## 11 1.0        297    6     
## 12 1.1        293    6     
## 13 1.2        272    6     
## 14 1.3        231    5     
## 15 1.4        170    5     
## 16 1.5        138    6     
## 17 1.6        129    7     
## 18 1.7         98    7     
## 19 1.8         88    7     
## 20 1.9         71    7     
## 21 2.0         62    7     
## 22 2.1         47    7     
## 23 2.2         43    7     
## 24 2.3         36    7     
## 25 2.4         30    7     
## 26 2.5         20    7     
## 27 2.6         20    7     
## 28 2.7         14    7     
## 29 2.8         12    7     
## 30 2.9         12    9     
## 31 3.0         12    9     
## 32 3.1         11    9     
## 33 3.2          9    10    
## 34 3.3          9    11    
## 35 3.4          6    11    
## 36 3.5          6    13    
## 37 3.6          6    14    
## 38 3.7          6    16    
## 39 3.8          4    16    
## 40 3.9          2    20    
## 41 4.0          1    20

2.2)

Compute the test error and the number of the contributing features for the following methods fitted to the training data:

  • Elastic net with the binomial response and a = 0.5 in which penalty is selected by the cross-validation
  • Support vector machine with “vanilladot”" kernel.

Compare the results of these models with the results of the nearest shrunken centroids (make a comparative table). Which model would you prefer and why?

##  Setting default kernel parameters
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## 
## Call:  glmnet(x = x, y = y, family = "binomial", alpha = 0.5) 
## 
##         Df       %Dev   Lambda
##   [1,]   0 -1.624e-16 0.668200
##   [2,]   1  1.301e-02 0.637800
##   [3,]   3  3.387e-02 0.608800
##   [4,]   6  5.984e-02 0.581100
##   [5,]   6  8.816e-02 0.554700
##   [6,]   6  1.149e-01 0.529500
##   [7,]   7  1.403e-01 0.505400
##   [8,]   7  1.657e-01 0.482500
##   [9,]   9  1.900e-01 0.460500
##  [10,]   9  2.141e-01 0.439600
##  [11,]  10  2.375e-01 0.419600
##  [12,]  10  2.596e-01 0.400600
##  [13,]  10  2.805e-01 0.382300
##  [14,]  11  3.012e-01 0.365000
##  [15,]  12  3.216e-01 0.348400
##  [16,]  12  3.409e-01 0.332500
##  [17,]  12  3.592e-01 0.317400
##  [18,]  13  3.775e-01 0.303000
##  [19,]  13  3.952e-01 0.289200
##  [20,]  15  4.123e-01 0.276100
##  [21,]  17  4.323e-01 0.263500
##  [22,]  20  4.524e-01 0.251600
##  [23,]  24  4.737e-01 0.240100
##  [24,]  25  4.951e-01 0.229200
##  [25,]  26  5.156e-01 0.218800
##  [26,]  27  5.354e-01 0.208800
##  [27,]  27  5.544e-01 0.199400
##  [28,]  28  5.724e-01 0.190300
##  [29,]  29  5.895e-01 0.181600
##  [30,]  32  6.062e-01 0.173400
##  [31,]  32  6.221e-01 0.165500
##  [32,]  33  6.373e-01 0.158000
##  [33,]  34  6.518e-01 0.150800
##  [34,]  35  6.657e-01 0.144000
##  [35,]  36  6.790e-01 0.137400
##  [36,]  38  6.916e-01 0.131200
##  [37,]  39  7.037e-01 0.125200
##  [38,]  40  7.152e-01 0.119500
##  [39,]  39  7.261e-01 0.114100
##  [40,]  39  7.365e-01 0.108900
##  [41,]  40  7.464e-01 0.103900
##  [42,]  41  7.559e-01 0.099220
##  [43,]  42  7.649e-01 0.094710
##  [44,]  43  7.734e-01 0.090410
##  [45,]  43  7.816e-01 0.086300
##  [46,]  45  7.894e-01 0.082370
##  [47,]  45  7.968e-01 0.078630
##  [48,]  50  8.040e-01 0.075060
##  [49,]  51  8.108e-01 0.071640
##  [50,]  51  8.173e-01 0.068390
##  [51,]  52  8.235e-01 0.065280
##  [52,]  53  8.294e-01 0.062310
##  [53,]  54  8.351e-01 0.059480
##  [54,]  54  8.405e-01 0.056780
##  [55,]  55  8.456e-01 0.054200
##  [56,]  57  8.505e-01 0.051730
##  [57,]  57  8.552e-01 0.049380
##  [58,]  57  8.596e-01 0.047140
##  [59,]  60  8.639e-01 0.045000
##  [60,]  68  8.680e-01 0.042950
##  [61,]  72  8.719e-01 0.041000
##  [62,]  74  8.756e-01 0.039130
##  [63,]  77  8.791e-01 0.037360
##  [64,]  80  8.825e-01 0.035660
##  [65,]  80  8.858e-01 0.034040
##  [66,]  80  8.889e-01 0.032490
##  [67,]  80  8.918e-01 0.031010
##  [68,]  80  8.946e-01 0.029600
##  [69,]  83  8.973e-01 0.028260
##  [70,]  84  8.998e-01 0.026970
##  [71,]  85  9.023e-01 0.025750
##  [72,]  85  9.046e-01 0.024580
##  [73,]  86  9.068e-01 0.023460
##  [74,]  86  9.089e-01 0.022390
##  [75,]  86  9.109e-01 0.021380
##  [76,]  86  9.128e-01 0.020400
##  [77,]  86  9.147e-01 0.019480
##  [78,]  88  9.164e-01 0.018590
##  [79,]  88  9.181e-01 0.017750
##  [80,]  88  9.197e-01 0.016940
##  [81,] 108  9.212e-01 0.016170
##  [82,] 123  9.227e-01 0.015440
##  [83,] 128  9.241e-01 0.014730
##  [84,] 129  9.254e-01 0.014060
##  [85,] 129  9.267e-01 0.013420
##  [86,] 143  9.279e-01 0.012810
##  [87,] 150  9.291e-01 0.012230
##  [88,] 156  9.302e-01 0.011680
##  [89,] 159  9.312e-01 0.011150
##  [90,] 161  9.322e-01 0.010640
##  [91,] 173  9.332e-01 0.010160
##  [92,] 188  9.341e-01 0.009694
##  [93,] 189  9.350e-01 0.009253
##  [94,] 190  9.359e-01 0.008833
##  [95,] 192  9.367e-01 0.008431
##  [96,] 183  9.374e-01 0.008048
##  [97,] 186  9.382e-01 0.007682
##  [98,] 188  9.389e-01 0.007333
##  [99,] 191  9.395e-01 0.007000
## [100,] 191  9.402e-01 0.006682
##            Length Class     Mode     
## a0            100 -none-    numeric  
## beta       470200 dgCMatrix S4       
## df            100 -none-    numeric  
## dim             2 -none-    numeric  
## lambda        100 -none-    numeric  
## dev.ratio     100 -none-    numeric  
## nulldev         1 -none-    numeric  
## npasses         1 -none-    numeric  
## jerr            1 -none-    numeric  
## offset          1 -none-    logical  
## classnames      2 -none-    character
## call            5 -none-    call     
## nobs            1 -none-    numeric

2.3)

Implement Benjamini-Hochberg method for the original data, and use t.test() for computing p-values. Which features correspond to the rejected hypotheses? Interpret the result.

feature_names p_values cv_scores
papers 0.0000000 0.0000106
submission 0.0000000 0.0000213
position 0.0000000 0.0000319
published 0.0000002 0.0000425
important 0.0000003 0.0000532
call 0.0000004 0.0000638
conference 0.0000005 0.0000744
candidates 0.0000009 0.0000851
dates 0.0000014 0.0000957
paper 0.0000014 0.0001063
topics 0.0000051 0.0001170
limited 0.0000079 0.0001276
candidate 0.0000119 0.0001382
camera 0.0000210 0.0001489
ready 0.0000210 0.0001595
authors 0.0000215 0.0001701
phd 0.0000338 0.0001808
projects 0.0000350 0.0001914
org 0.0000374 0.0002020
chairs 0.0000586 0.0002127
due 0.0000649 0.0002233
original 0.0000649 0.0002339
notification 0.0000688 0.0002446
salary 0.0000797 0.0002552
record 0.0000909 0.0002658
skills 0.0000909 0.0002765
held 0.0001529 0.0002871
team 0.0001758 0.0002977
pages 0.0002007 0.0003084
workshop 0.0002007 0.0003190
committee 0.0002117 0.0003296
proceedings 0.0002117 0.0003403
apply 0.0002166 0.0003509
strong 0.0002246 0.0003615
international 0.0002296 0.0003722
degree 0.0003762 0.0003828
excellent 0.0003762 0.0003934
post 0.0003762 0.0004041
presented 0.0003765 0.0004147

Appendix

knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(ggplot2)
library(knitr)
library(mgcv)
library(pamr)
library(kernlab)
library(glmnet)
set.seed(1234567890)


################################################################################
# 1) Using GAM and GLM to examine the mortality rates
################################################################################

set.seed(12345)
influanza = read_excel("./influenza.xlsx")
#creditscoring$good_bad = as.factor(creditscoring$good_bad)
kable(head(influanza), caption = "influenza.xlsx")


################################################################################
# 1.1) 
################################################################################

ggplot(influanza) +
  geom_line(aes(x = influanza$Time, y = influanza$Mortality,
                colour = "Mortality Time Series")) +
  labs(title = "Mortality", y = "Mortality",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("blue"))


ggplot(influanza) +
  geom_line(aes(x = influanza$Time, y = influanza$Influenza,
                colour = "Influenza Time Series")) +
  labs(title = "Influenza", y = "Influenza",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("orange"))

ggplot(influanza) +
  geom_line(aes(x = influanza$Time, y = influanza$Mortality,
                colour = "Mortality Time Series")) +
  geom_line(aes(x = influanza$Time, y = influanza$Influenza,
                colour = "Influenza Time Series")) +
  labs(title = "Superimposed Time Series", y = "Values",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("red", "brown"))

################################################################################
# 1.2) 
################################################################################

# - Mortality is normally distributed and modelled as...
# - ...a linear function of Year...
# - ...and spline function of Week
# Don't forget to select the model bei generalized cross-validation

gam_model = gam(formula = Mortality ~ Year + s(Week), family = gaussian(), data = influanza, method="GCV.Cp") 

print(gam_model)


summary(gam_model)
plot(gam_model)

time = influanza$Time
observed = influanza$Mortality
predicted = gam_model$fitted.values

mortality_values = data.frame(time, observed, predicted)

ggplot(mortality_values) +
  geom_line(aes(x = time, y = observed,
                colour = "Observed Mortality")) +
  geom_line(aes(x = time, y = predicted,
                colour = "Predicted Mortality")) +
  labs(title = "Observed vs Predicted Mortality", y = "Mortality",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("blue", "orange"))


ggplot() +
  geom_line(aes(x=influanza$Week,y=influanza$Mortality,
                         colour=as.factor(influanza$Year))) +
  labs(x='Week', y='Mortality', colour='Year',
       title='Mortality by Week and by Year')


penalties = seq(from = 0 , to = 20, by = 0.1)
index = 1
fitted_gam_with_penalties = data.frame()

for (penalty in penalties) {
  gam_model = gam(formula = Mortality ~ Year + s(Week, sp = penalty),
                  family = gaussian(), data = influanza, method="GCV.Cp") 
  
  fitted_gam_with_penalties = rbind(fitted_gam_with_penalties,
        data.frame(list(penalty = penalty,
                        deviance = gam_model$deviance,
                        df = sum(influence(gam_model)))))
}

# Deviance VS Penalty

ggplot(fitted_gam_with_penalties) +
  geom_line(aes(x = fitted_gam_with_penalties$penalty,
                y = fitted_gam_with_penalties$deviance,
                colour = "Deviance")) +
  labs(title = "Deviance VS Penalty", y = "Deviance",
       x = "Penalty", color = "Legend") +
  scale_color_manual(values = c("blue", "orange"))

## Create Models for High and Low penalties
gam_model_low_pen = gam(formula = Mortality ~ Year + s(Week, sp = 0.1),
                  family = gaussian(), data = influanza, method="GCV.Cp") 
gam_model_high_pen = gam(formula = Mortality ~ Year + s(Week, sp = 20),
                  family = gaussian(), data = influanza, method="GCV.Cp") 

high_low_df = data.frame()
high_low_df = rbind(high_low_df,
                    list(mortality = influanza$Mortality,
                          mortality_low = fitted(gam_model_low_pen),
                          mortality_high = fitted(gam_model_high_pen),
                          date = influanza$Time))

ggplot(high_low_df) +
  geom_line(aes(x = date, y = mortality, colour = "Mortality")) +
  geom_line(aes(x = date, y = mortality_low, colour = "Mortality (Low)")) +
  geom_line(aes(x = date, y = mortality_high, colour = "Mortality (High)")) +
  labs(title = "Mortalities vs Time", y = "Mortality",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("blue", "orange", "violet"))

ggplot(mortality_values) +
  geom_line(aes(x = time, y = observed,
                colour = "Observed Mortality")) +
  geom_line(aes(x = time, y = predicted,
                colour = "Predicted Mortality")) +
  labs(title = "Observed vs Predicted Mortality", y = "Mortality",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("blue", "orange"))


time = influanza$Time
resid = gam_model$residuals
influenza_data = influanza$Influenza

print_data = data.frame(time, resid, influenza_data)

ggplot(print_data) +
  geom_line(aes(x = time, y = resid,
                colour = "Residuals")) +
  geom_line(aes(x = time, y = influenza_data,
                colour = "Influenza")) +
  labs(title = "Residuals and Influenza vs Time", y = "Residuals and Mortality",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("blue", "orange"))


gam_model_additive = gam(formula = Mortality ~ Year + s(Week) + Influenza, family = gaussian(), data = influanza, method="GCV.Cp")

summary(gam_model)
plot(gam_model)

time = influanza$Time
observed = influanza$Mortality
predicted = gam_model_additive$fitted.values

mortality_values = data.frame(time, observed, predicted)

ggplot(mortality_values) +
  geom_line(aes(x = time, y = observed,
                colour = "Observed Mortality")) +
  geom_line(aes(x = time, y = predicted,
                colour = "Predicted Mortality")) +
  labs(title = "Observed vs Predicted Mortality", y = "Mortality",
       x = "Time", color = "Legend") +
  scale_color_manual(values = c("blue", "orange"))


################################################################################
# 2) High-dimensional method
################################################################################

emails = read.csv2("data.csv", fileEncoding = "ISO-8859-1", sep = ";")
emails$Conference = as.factor(emails$Conference)
kable(head(emails[,1:10]))

n = dim(emails)[1]
id = sample(1:n, floor(n*0.70))
train_emails = emails[id,]
val_emails = emails[-id,]


################################################################################
# 2.1)
################################################################################

rownames(train_emails) = 1:nrow(train_emails)
x = t(train_emails[,-ncol(train_emails)])
y = train_emails[[ncol(train_emails)]]
mydata = list(x=x ,y=as.factor(y), geneid = as.character(1:nrow(x)),
              genenames = rownames(x))

nsc_model = pamr.train(mydata, threshold=seq(0, 4, 0.1))
nsc_model_cv = pamr.cv(nsc_model, mydata)


pamr.plotcen(nsc_model, mydata, threshold = 1)
pamr.plotcen(nsc_model, mydata, threshold = 2.5)

print(nsc_model_cv)
pamr.plotcv(nsc_model_cv)


# Elastic Net

# Predictor Variables. The -1 removes the intercept component
x = model.matrix(Conference ~ ., train_emails)[,-1]

# Outcome variable
y = train_emails$Conference

model_elastic_net = glmnet(x, y, alpha = 0.5, family = "binomial")

# Support vector machine with "vanilladot" kernel.
# Vanilladot means "Linear Kernel Function"
#model_svm = svm(x = x, y = y, kernel = "linear")
model_svm = ksvm(Conference ~ ., train_emails, kernel = "vanilladot")


print(model_elastic_net)
summary(model_elastic_net)
plot(model_elastic_net)


# Orignal Data Set: emails
# H0: Feature has effect on Conference
# Ha: Feature has no effect on Conference

# 1) Calculate p-values
# 2) Assign ranks and sort
# 3) BH Critical Value
# 4) Find critical value and select all p < score

q_value = 0.05

## 1)

feature_names = c()
p_values = c()

for (i in 1:(ncol(emails)-1)) {
  colname = colnames(emails)[i]
  c_formula = paste(sep = "", colname, " ~ Conference")
  # one.sided, less, greater
  p_value = t.test(as.formula(c_formula), data = emails, alternative="two.sided")
  
  feature_names = c(feature_names, colname)
  p_values = c(p_values, p_value$p.value)
}

## 2)

bh_entries = data.frame(feature_names, p_values)
bh_entries = bh_entries[order(bh_entries$p_values, decreasing = FALSE),]
rownames(bh_entries) = NULL


## 3)

getCV_BH = function(i, m, Q) {
  return(i/m*Q)
}

cv_scores = c()

for (j in 1:nrow(bh_entries)) {
  current_val = getCV_BH(as.numeric(rownames(bh_entries)[j]), nrow(bh_entries), q_value)
  cv_scores = c(cv_scores, current_val)
}

#cv_col = apply(bh_entries, 1, function(x) getCV_BH(as.numeric(rownames(bh_entries)), nrow(bh_entries), q_value))
bh_entries = data.frame(bh_entries, cv_scores)

## 4)
## Find all rows where p_values < cv_scores
bh_entries = bh_entries[bh_entries$p_values < bh_entries$cv_scores,]

kable(bh_entries)

Bibliography